Neighbor-Sensitive Hashing

نویسندگان

  • Yongjoo Park
  • Michael J. Cafarella
  • Barzan Mozafari
چکیده

Approximate kNN (k-nearest neighbor) techniques using binary hash functions are among the most commonly used approaches for overcoming the prohibitive cost of performing exact kNN queries. However, the success of these techniques largely depends on their hash functions’ ability to distinguish kNN items; that is, the kNN items retrieved based on data items’ hashcodes, should include as many true kNN items as possible. A widely-adopted principle for this process is to ensure that similar items are assigned to the same hashcode so that the items with the hashcodes similar to a query’s hashcode are likely to be true neighbors. In this work, we abandon this heavily-utilized principle and pursue the opposite direction for generating more effective hash functions for kNN tasks. That is, we aim to increase the distance between similar items in the hashcode space, instead of reducing it. Our contribution begins by providing theoretical analysis on why this revolutionary and seemingly counter-intuitive approach leads to a more accurate identification of kNN items. Our analysis is followed by a proposal for a hashing algorithm that embeds this novel principle. Our empirical studies confirm that a hashing algorithm based on this counter-intuitive idea significantly improves the efficiency and accuracy of state-of-the-art techniques.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Approximate Nearest-Neighbor Field by Cascaded Spherical Hashing

We present an e cient and fast algorithm for computing approximate nearest neighbor elds between two images. Our method builds on the concept of Coherency-Sensitive Hashing (CSH), but uses a recent hashing scheme, Spherical Hashing (SpH), which is known to be better adapted to the nearest-neighbor problem for natural images. Cascaded Spherical Hashing concatenates di erent con gurations of SpH ...

متن کامل

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes ...

متن کامل

Two-Stage Hashing for Fast Document Retrieval

This work fulfills sublinear time Nearest Neighbor Search (NNS) in massivescale document collections. The primary contribution is to propose a two-stage unsupervised hashing framework which harmoniously integrates two state-of-theart hashing algorithms Locality Sensitive Hashing (LSH) and Iterative Quantization (ITQ). LSH accounts for neighbor candidate pruning, while ITQ provides an efficient ...

متن کامل

Developing a Good Hash Function for LSH

In the previous lecture we saw how to design locality sensitive hashing (LSH) for hamming and l1 distance, as a solution to the (c,R)-Near Neighbor problem. Whenever we use LSH for the nearest neighbor search, using some distance measure, the main task is to come up with a good elementary hashing function. This elementary hash function (H) is then used to create a composite hash function (G) fo...

متن کامل

lsh, Nearest neighbor search in high dimensions

Calculating distance pairs is O(n2) in memory and time and finding the nearest neighbor is O(n) in time. Tree indexing techniques like kd-tree [2] were developed to cope with large n, however their performance quickly breaks down for p > 3 [3]. Locality sensitive hashing (LSH) [3] is a technique for generating hash numbers from high dimensional data, such that nearby points have identical hashe...

متن کامل

Efficient Search in Document Image Collections

This paper presents an efficient indexing and retrieval scheme for searching in document image databases. In many non-European languages, optical character recognizers are not very accurate. Word spotting word image matching may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic timewarping and/or nearest neighbor sea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2015